WEBVTT

00:00.560 --> 00:06.230
Hello everybody and welcome back to the session about visualization of NAM payer race and regression

00:06.740 --> 00:12.730
so Islamization of Man by race works actually in the very same way as its lists.

00:12.830 --> 00:19.230
So this should be a repetition of we already know about line plots scatter plots and his style grams

00:19.640 --> 00:22.330
and we will also have a look at the regression analysis.

00:22.360 --> 00:30.530
And it works very easy with NUM pi and not only perform linear but also polynomial regression.

00:30.530 --> 00:35.730
And last but not least to try to approximate an assumed functional relationship.

00:35.740 --> 00:42.590
Now given set of data points with the Taylor serious expansion techniques and how does definitely more

00:43.010 --> 00:48.260
advanced topic in this course but yeah we will see it later in detail and hopefully understand how it

00:48.260 --> 00:49.190
works.

00:49.190 --> 00:49.570
All right.

00:49.580 --> 00:52.500
So let's first of all important campaign.

00:53.330 --> 00:59.690
And then we implement plot let the PI plot as pure here and we have to submit plot live in Lanyon Jupiter

00:59.690 --> 01:08.180
notebook and we use them the style seaborne and then we create a normally distributed random numbers

01:08.370 --> 01:19.320
10000 numbers with mean 5 and standard deviation to so we get an array y with 10000 elements and then

01:19.320 --> 01:28.710
we make an histogram so we pass our NDA array right to the histogram we set binds to a hundred and a

01:28.830 --> 01:34.380
we label the data data then we enlarge the size of the figure with purity figure.

01:34.380 --> 01:43.830
Fixed size equals 10 6 and reset to the title of our histogram to frequency distribution of Y and we

01:43.830 --> 01:50.970
call our X label y and our y label frequency and also we enable a legend in our histogram.

01:50.970 --> 01:56.520
So let's let's run the circle and see what we get here yeah.

01:56.550 --> 02:02.910
So we get here and histogram with normally disputed numbers and already look like a bell shaped curve.

02:02.910 --> 02:09.190
So we have 10000 data points and our if you further increase this to a hundred thousand or 1 million

02:09.190 --> 02:15.480
that what would be a very good bell shaped curve and we can also hear plot a vertical line into our

02:15.480 --> 02:23.370
graph in the vertical line is that the value of the mean of our ending array.

02:23.370 --> 02:25.780
So this gives us a quite good impression.

02:25.800 --> 02:31.710
So we determined and I want a random process and that the means would be 5 and the mean of our random

02:31.710 --> 02:34.730
numbers is actually 5 roundabout.

02:34.740 --> 02:35.150
All right.

02:35.160 --> 02:41.010
So and there's a function called NPL Lynn space and what MP Lynn space does that creates evenly spaced

02:41.010 --> 02:49.080
numbers over a specified interval so we specify yeah let's press shift tab you specify here our stop

02:49.080 --> 02:56.040
point that's 1 Our stop point that's 10 and then we say we want to have 50 evenly spaced numbers between

02:56.040 --> 02:57.000
1 and 10.

02:57.030 --> 02:59.070
So let's let's see what we get here.

02:59.310 --> 03:06.660
So by default the number is 50 but we set it here to 10 so we get 10 evenly spaced numbers in the interval

03:06.960 --> 03:08.130
between 1 and 10.

03:08.130 --> 03:12.590
And yeah we get actually one two three four five six seven eight nine ten.

03:12.720 --> 03:19.080
So let's create 1000 evenly spaced numbers in the interval minus 10 to 10 and call it x.

03:19.980 --> 03:24.290
So now we have an umpire array X with 1000 numbers.

03:24.540 --> 03:27.960
And this is very helpful when we want to plot a graph or a function.

03:27.960 --> 03:34.740
So with Lynn space we can create many x values that are close together and then we can calculate a functional

03:34.740 --> 03:41.550
relationship over our x values and then we can calculate y values depending on our x values and the

03:41.720 --> 03:46.730
functional relationship so let's assume here we have our functional relationship.

03:46.730 --> 03:54.860
So why is this three times x to the power of three minus two times X to the power of two minus five

03:54.920 --> 03:58.480
X minus five.

03:58.490 --> 04:01.870
So now this is we already know a vector rights operations.

04:01.870 --> 04:07.950
So for each element of our end the array x you get a corresponding y value here.

04:07.960 --> 04:09.300
Here we have the array y.

04:09.860 --> 04:16.910
So now we have x and y values and they are we can actually plot these in a two dimensional graph so

04:16.910 --> 04:18.460
we plot x and y

04:22.150 --> 04:30.000
and they can see here our function which we are determined here and we can also define another function.

04:30.000 --> 04:32.310
So for example the symbols function.

04:32.320 --> 04:35.080
So why is the the symbols of X

04:38.240 --> 04:44.850
so now I fear repeating the plotting So now why stands for the symbols of X.

04:45.710 --> 04:48.500
And yet we get the symbols plot.

04:48.740 --> 04:51.240
All right so far we had a histogram and a plot.

04:51.250 --> 04:54.020
Let's go on with the scatter plot.

04:54.020 --> 05:02.480
We generate 20 normal distributed numbers with the mean 10 and standard deviation 2 and reshape those

05:02.750 --> 05:05.940
into two rows of 10 columns.

05:06.650 --> 05:14.200
That's called The Matrix m so we have two rows one row two rows with each.

05:14.270 --> 05:15.410
10 elements.

05:15.410 --> 05:19.980
And then we slice them our matrix and we and we call our first row a.

05:21.830 --> 05:26.250
And the second row B.

05:26.420 --> 05:30.300
Now we have two arrays with the 10 elements each.

05:30.320 --> 05:32.530
And then we can make us get a plot of a and b

05:36.660 --> 05:38.370
OK so let's see here.

05:38.370 --> 05:40.600
So our first point is.

05:40.930 --> 05:45.830
So ace on the x axis and peace on the y axis.

05:46.020 --> 05:53.460
And for example at C so we have our first point it's on the XXL seven point eight and then the Y eight

05:53.460 --> 05:56.520
point six so seven point eight and eight point six.

05:56.520 --> 06:01.630
So this is uh this point here and maybe another point.

06:01.830 --> 06:06.780
So we have to point this on the x axis twelve and on the y axis nine point eight

06:09.540 --> 06:11.640
and this would be this point here.

06:11.820 --> 06:18.360
And what we can do now we can do a linear regression that fits a straight line into our graph here and

06:18.360 --> 06:21.470
we can do this with the Frankston MP poorly fit.

06:21.480 --> 06:28.570
So we have to pass our two arrays A and B and we have to determine the degree of our polynomial.

06:28.600 --> 06:35.500
So one stands for a linear two A for grade quadratic function and 3 stands for cubic.

06:35.500 --> 06:39.200
And so on but we will see this later.

06:39.200 --> 06:45.680
So now let's run here the settle and what end people if it returns actually it returns the polynomial

06:45.680 --> 06:54.440
coefficients from the highest power to the first and a tuple so we have here minus 0.01 5 and Twelfth

06:54.470 --> 06:59.440
and this gives us actually the functional relationship of our linear regression.

06:59.480 --> 07:07.190
So the functional relationship is here actually p equals ten point five 1 minus 0 point 1 5 9 9 times

07:07.310 --> 07:08.300
a.

07:08.300 --> 07:12.480
So this is here the intercept of our function and this is the first coefficient.

07:12.710 --> 07:18.740
All right so now let's create some X and Y value so we create X well you to settle in space and we can

07:18.740 --> 07:27.050
create our y values with the function and P poorly are so empty Pali y creates the y values given x

07:27.050 --> 07:36.180
values and the polynomial coefficients rect one which we calculated here with and P poorly fit so let's

07:36.180 --> 07:44.400
calculate both x and y values and then now we can plot our linear regression line so we plotted in our

07:44.400 --> 07:51.990
graph where we also have our shadows here of a points that we plot in our linear regression line with

07:51.990 --> 07:56.870
our x and y values so let's execute here

08:00.400 --> 08:06.300
the Cicely linear regression line so a progressive line is the line that best fits to the data.

08:06.300 --> 08:07.140
What does it mean.

08:07.350 --> 08:08.630
So in very simple words.

08:08.650 --> 08:15.360
So each point has a distance here from its actual y value to the to the regression line and what the

08:15.360 --> 08:21.930
progression line does is actually minimizes to say the sum of all distances so detailed it minimizes

08:22.020 --> 08:24.850
the sum of all squared distances.

08:24.990 --> 08:26.660
So an easy word is just them.

08:26.670 --> 08:30.210
The line that best fits here into into our data.

08:30.210 --> 08:30.560
All right.

08:30.570 --> 08:36.820
So what we can do now we can make also a crude rhetoric regression by setting the degree to two.

08:38.160 --> 08:44.550
So let's run it here and then we get here the polynomial coefficients sixteen point nine eighths of

08:44.550 --> 08:51.660
the intercept 0 point 0 5 is that the coefficient it's the highest power and the functional relationships

08:51.720 --> 08:58.920
looks like this here and we can also calculate the cubic regression so with the polynomial degree 3

09:01.050 --> 09:05.930
and this gives us a functional relationship between a and b in this manner here.

09:05.940 --> 09:14.010
So yes here the polynomial coefficients and then we can also plot in our cubic and quadratic regression

09:14.010 --> 09:17.130
on our graph thoughts already here.

09:20.440 --> 09:28.810
And let's plot both so and you can see that this is the graph for the quadratic regression and the screen

09:28.810 --> 09:35.440
graph here is the graph for the cubic regression and the more you increase the degree of the regression

09:35.770 --> 09:38.080
the better the graph fits our data points.

09:38.230 --> 09:44.350
So the average absolute distance between our cubic graph and the data points it's much lower than the

09:44.350 --> 09:49.570
average distance between the data points and our linear regression graph.

09:49.600 --> 09:55.420
And last but not least we can have a perfect regression where all the data points are on our regression

09:55.480 --> 09:55.900
graph.

09:55.930 --> 09:59.920
So the distance between our data points and the graph is zero.

09:59.980 --> 10:07.690
And of course therefore minimized and in our case we have 10 data points and we can reach a perfect

10:07.690 --> 10:11.800
regression with a polynomial degree of ten minus one.

10:11.800 --> 10:18.940
So with some degree of nine so let's set here perfect regression and we set end p poorly fit we pass

10:18.970 --> 10:26.410
our arrays a and b and we say okay our degrees should be the length of a and b of course minus 1.

10:28.330 --> 10:35.590
So now we get to a 10 polynomial coefficients and now we can plot our data points and our perfect regression

10:35.590 --> 10:42.870
curve and they can see the regression curve perfectly fits our data points and you can see here.

10:42.880 --> 10:45.380
So we have a different scale than here with Alice get us.

10:45.400 --> 10:48.990
So it's here 9 to 14 on the y axis.

10:49.000 --> 10:52.380
And here it's the zero to 800.

10:52.390 --> 10:54.400
So therefore let's assume in

10:58.520 --> 11:04.930
and there you can even better see that our regression curve perfectly fits our data points here.

11:04.970 --> 11:12.090
So what is a perfect regression useful for us on our case to we created to erase a and b that are not

11:12.230 --> 11:14.420
actually independent and randomly created.

11:14.420 --> 11:17.640
So that's definitely a no functional relationship.

11:17.690 --> 11:24.860
And not only our perfect regression graph all of our regression curves do over fit the data so they

11:24.860 --> 11:27.950
indicate a relationship where there's actually none.

11:27.980 --> 11:31.320
And where fitting is the real problem in data science.

11:31.520 --> 11:36.410
But I mean on the other hand side to feel sure that there's a functional relationship and we do not

11:36.410 --> 11:43.880
have the math skills or tools to actually express or to find the functional relationship then it might

11:43.880 --> 11:49.430
be a good idea to approximate dysfunctional relationship with the or the Taylor series expansion so

11:49.430 --> 11:53.280
what we did here is actually a Taylor serious expansion.

11:53.360 --> 11:58.070
All right so now we are finished with the session and I hope you enjoyed it.

11:58.070 --> 11:59.630
And yes your next session.
